Lesson 3 Tidy data
… in which we explore the concept of Tidy Data and learn more advanced data wrangling techniques
3.2 Tidy data
3.2.1 What and why is tidy data?

Figure from https://r4ds.had.co.nz/tidy-data.html Wickham and Grolemund4
palmerpenguins::penguins## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
3.2.2 Make data tidy
with the tidyr package.
“Happy families are all alike; every unhappy family is unhappy in its own way”
— Leo Tolstoy (https://tidyr.tidyverse.org/articles/tidy-data.html)
Let’s make some data tidy!
table1
table1## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
3.2.3 pivot_wider
table2## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>%
pivot_wider(names_from = type, values_from = count)## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
3.2.4 separate
table3 %>%
separate(col = rate, into = c("cases", "population"), sep = "/")## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table3
3.2.5 pivot_longer
table4a %>%
pivot_longer(-country, names_to = "year", values_to = "cases")## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
table4a
table4b %>%
pivot_longer(-country, names_to = "year", values_to = "population")## # A tibble: 6 × 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
clean_wide_data <- function(data, values_column) {
data %>%
pivot_longer(-country, names_to = "year", values_to = values_column)
}
clean4a <- table4a %>%
clean_wide_data("cases")
clean4b <- table4b %>%
clean_wide_data("population")
3.2.6 left_join
left_join(clean4a, clean4b, by = c("country", "year"))## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
-
table4aandtable4b
3.2.7 unite
table5
table5 %>%
unite("year", century, year, sep = "") %>%
separate(rate, c("cases", "population")) %>%
mutate(
year = parse_number(year),
cases = parse_number(cases),
population = parse_number(population)
)## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table5 %>%
unite("year", century, year, sep = "") %>%
separate(rate, c("cases", "population")) %>%
mutate(
across(c(year, cases, population), parse_number)
)## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table5 %>%
unite("year", century, year, sep = "") %>%
separate(rate, c("cases", "population")) %>%
mutate(
across(-country, parse_number)
)## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
3.2.8 Another example
billboard- explicit vs implicit
NAs na.omit
billboard %>%
pivot_longer(starts_with("wk"), names_to = "week", values_to = "placement") %>%
mutate(week = parse_number(week))## # A tibble: 24,092 × 5
## artist track date.entered week placement
## <chr> <chr> <date> <dbl> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 8 NA
## 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 9 NA
## 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 10 NA
## # … with 24,082 more rows
tidy_bilboard <- billboard %>%
pivot_longer(starts_with("wk"),
names_to = "week",
values_to = "placement",
names_prefix = "wk",
names_transform = list(week = as.integer)
)3.3 More shapes for data
- omitted:
- matrices
- arrays
3.3.1 Lists
c(first = 1, second = 2)## first second
## 1 2
## $first
## [1] 1
##
## $second
## [1] 2
##
## [[3]]
## [1] "some text"
##
## [[4]]
## [[4]][[1]]
## [1] 1
##
## [[4]][[2]]
## [1] 2
##
##
## [[5]]
## [1] 1 2 3 4 5
palmerpenguins::penguins## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
3.3.2 Nested data
example <- tibble(
x = 1:3,
y = list(
"hello",
TRUE,
1:4
)
)
example## # A tibble: 3 × 2
## x y
## <int> <list>
## 1 1 <chr [1]>
## 2 2 <lgl [1]>
## 3 3 <int [4]>
# View(example)
nested <- palmerpenguins::penguins %>%
nest(data = -island)
nested## # A tibble: 3 × 2
## island data
## <fct> <list>
## 1 Torgersen <tibble [52 × 7]>
## 2 Biscoe <tibble [168 × 7]>
## 3 Dream <tibble [124 × 7]>
nested$data[[1]]## # A tibble: 52 × 7
## species bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex year
## <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie 39.1 18.7 181 3750 male 2007
## 2 Adelie 39.5 17.4 186 3800 fema… 2007
## 3 Adelie 40.3 18 195 3250 fema… 2007
## 4 Adelie NA NA NA NA <NA> 2007
## 5 Adelie 36.7 19.3 193 3450 fema… 2007
## 6 Adelie 39.3 20.6 190 3650 male 2007
## 7 Adelie 38.9 17.8 181 3625 fema… 2007
## 8 Adelie 39.2 19.6 195 4675 male 2007
## 9 Adelie 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie 42 20.2 190 4250 <NA> 2007
## # … with 42 more rows
nested %>%
unnest(data)## # A tibble: 344 × 8
## island species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Torgersen Adelie 39.1 18.7 181 3750
## 2 Torgersen Adelie 39.5 17.4 186 3800
## 3 Torgersen Adelie 40.3 18 195 3250
## 4 Torgersen Adelie NA NA NA NA
## 5 Torgersen Adelie 36.7 19.3 193 3450
## 6 Torgersen Adelie 39.3 20.6 190 3650
## 7 Torgersen Adelie 38.9 17.8 181 3625
## 8 Torgersen Adelie 39.2 19.6 195 4675
## 9 Torgersen Adelie 34.1 18.1 193 3475
## 10 Torgersen Adelie 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
3.5 Resources
- tidyr documentation
- purrr documentation
- stringr documentation for working with text and a helpful cheatsheet for the regular expressions mentioned in the video